Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
I just picked this course from University of Helsinki course catalog.
I am really excited to learn R, because I want to get new tools to help me to process data. I usually use Excel and as a software engineer it would be better to use my current skills at their maximum.
The first examples were really basic, but I found it interesting because the idea on learning this kind of programming language is to get things done quickly.
Github: https://github.com/plipplopplipplop/IODS-project
Github Pages: https://plipplopplipplop.github.io/IODS-project
Describe the work you have done this week and summarize your learning.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
rd <- read.table(".\\data\\learning2014.txt",
as.is = TRUE,
sep = "\t",
header = TRUE)
str(rd)
## 'data.frame': 183 obs. of 71 variables:
## $ Aa : int 3 2 4 4 3 4 4 3 2 3 ...
## $ Ab : int 1 2 1 2 2 2 1 1 1 2 ...
## $ Ac : int 2 2 1 3 2 1 2 2 2 1 ...
## $ Ad : int 1 2 1 2 1 1 2 1 1 1 ...
## $ Ae : int 1 1 1 1 2 1 1 1 1 1 ...
## $ Af : int 1 1 1 1 1 1 1 1 1 2 ...
## $ ST01 : int 4 4 3 3 4 4 5 4 4 4 ...
## $ SU02 : int 2 2 1 3 2 3 2 2 1 2 ...
## $ D03 : int 4 4 4 4 5 5 4 4 5 4 ...
## $ ST04 : int 4 4 4 4 3 4 2 5 5 4 ...
## $ SU05 : int 2 4 2 3 4 3 2 4 2 4 ...
## $ D06 : int 4 2 3 4 4 5 3 3 4 4 ...
## $ D07 : int 4 3 4 4 4 5 4 4 5 4 ...
## $ SU08 : int 3 4 1 2 3 4 4 2 4 2 ...
## $ ST09 : int 3 4 3 3 4 4 2 4 4 4 ...
## $ SU10 : int 2 1 1 1 2 1 1 2 1 2 ...
## $ D11 : int 3 4 4 3 4 5 5 3 4 4 ...
## $ ST12 : int 3 1 4 3 2 3 2 4 4 4 ...
## $ SU13 : int 3 3 2 2 3 1 1 2 1 2 ...
## $ D14 : int 4 2 4 4 4 5 5 4 4 4 ...
## $ D15 : int 3 3 2 3 3 4 2 2 3 4 ...
## $ SU16 : int 2 4 3 2 3 2 3 3 4 4 ...
## $ ST17 : int 3 4 3 3 4 3 4 3 4 4 ...
## $ SU18 : int 2 2 1 1 1 2 1 2 1 2 ...
## $ D19 : int 4 3 4 3 4 4 4 4 5 4 ...
## $ ST20 : int 2 1 3 3 3 3 1 4 4 2 ...
## $ SU21 : int 3 2 2 3 2 4 1 3 2 4 ...
## $ D22 : int 3 2 4 3 3 5 4 2 4 4 ...
## $ D23 : int 2 3 3 3 3 4 3 2 4 4 ...
## $ SU24 : int 2 4 3 2 4 2 2 4 2 4 ...
## $ ST25 : int 4 2 4 3 4 4 1 4 4 4 ...
## $ SU26 : int 4 4 4 2 3 2 1 4 4 4 ...
## $ D27 : int 4 2 3 3 3 5 4 4 5 4 ...
## $ ST28 : int 4 2 5 3 5 4 1 4 5 2 ...
## $ SU29 : int 3 3 2 3 3 2 1 2 1 2 ...
## $ D30 : int 4 3 4 4 3 5 4 3 4 4 ...
## $ D31 : int 4 4 3 4 4 5 4 4 5 4 ...
## $ SU32 : int 3 5 5 3 4 3 4 4 3 4 ...
## $ Ca : int 2 4 3 3 2 3 4 2 3 2 ...
## $ Cb : int 4 4 5 4 4 5 5 4 5 4 ...
## $ Cc : int 3 4 4 4 4 4 4 4 4 4 ...
## $ Cd : int 4 5 4 4 3 4 4 5 5 5 ...
## $ Ce : int 3 5 3 3 3 3 4 3 3 4 ...
## $ Cf : int 2 3 4 4 3 4 5 3 3 4 ...
## $ Cg : int 3 2 4 4 4 5 5 3 5 4 ...
## $ Ch : int 4 4 2 3 4 4 3 3 5 4 ...
## $ Da : int 3 4 1 2 3 3 2 2 4 1 ...
## $ Db : int 4 3 4 4 4 5 4 4 2 4 ...
## $ Dc : int 4 3 4 5 4 4 4 4 4 4 ...
## $ Dd : int 5 4 1 2 4 4 5 3 5 2 ...
## $ De : int 4 3 4 5 4 4 5 4 4 2 ...
## $ Df : int 2 2 1 1 2 3 1 1 4 1 ...
## $ Dg : int 4 3 3 5 5 4 4 4 5 1 ...
## $ Dh : int 3 3 1 4 5 3 4 1 4 1 ...
## $ Di : int 4 2 1 2 3 3 2 1 4 1 ...
## $ Dj : int 4 4 5 5 3 5 4 5 2 4 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ d_sm : int 15 13 15 13 16 19 17 15 19 16 ...
## $ d_ri : int 15 10 16 15 14 20 17 13 17 16 ...
## $ d_ue : int 13 12 11 14 14 18 12 11 16 16 ...
## $ deep : int 43 35 42 42 44 57 46 39 52 48 ...
## $ st_os : int 14 14 13 12 16 15 12 15 16 16 ...
## $ st_tm : int 13 8 16 13 13 14 6 17 18 12 ...
## $ stra : int 27 22 29 25 29 29 18 32 34 28 ...
## $ su_lp : int 10 9 7 7 8 8 5 10 7 10 ...
## $ su_um : int 11 12 8 11 12 10 5 11 6 12 ...
## $ su_sb : int 10 17 12 9 14 11 13 13 13 14 ...
## $ surf : int 31 38 27 27 34 29 23 34 26 36 ...
The data set is collected as a part of international research project where the study topic was to ask questions about teaching and learning. There were a total of 183 participants who answered the questionnaire. The questionaire was about how do attendee’s study in the course. Are they doing a deep learning, where they study hard and focus into details, or are they strategic where they plan their studies and work evenly through the course time. Last on is do they study on surface so they get the course easily done and move to next topic early.
There are 32 questions (in data set marked with STxx, SUxx, and Dxx) which are collected for three internal dimensions (deep, surface, and strategic). Each of the set of the questions (STxx, SUxx, and Dxx) are classified and later collected into collections which interpret higher idea behind each approach (d_xx, su_xx and st_xx). These classifications are used to create the overall points of each approach (values deep, surf, and stra).
Assistant questions are asked in the questionnaires, which are measured with Aa-Af, Ca-Ch, and Da-Dj. Additionally the basic data are also stored into this data set (age, gender, attitude and total points).
myplot <- ggpairs(rd,
columns = c(57, 58, 59, 64, 67, 71),
mapping = ggplot2::aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
myplot
The charts are colour coded by the gender (red being Female). With this chart the reader can see distribution of questionnaire points depending on the attendee’s gender, age, and attitude to statistics
The students were quite young and the attitude was a little bit better with Males. Overall points were quite good and the deep approach to studying was the most common one. The strategic method was used in combination of deep learning but the surface learning approach was not that common.
diagplots <- lm(points ~ surf + stra + deep, rd)
summary(diagplots)
##
## Call:
## lm(formula = points ~ surf + stra + deep, data = rd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.222 -2.516 1.750 5.531 13.291
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.12137 6.81477 3.980 0.00010 ***
## surf -0.17294 0.10303 -1.678 0.09500 .
## stra 0.28027 0.10246 2.735 0.00686 **
## deep -0.17222 0.09888 -1.742 0.08326 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.46 on 179 degrees of freedom
## Multiple R-squared: 0.06432, Adjusted R-squared: 0.04864
## F-statistic: 4.101 on 3 and 179 DF, p-value: 0.007617
I chose three explanatory variables: deep, surf and stra.
I chose these variables because they are the collection of the set of questions. These three variables have relationship towards the total points of the questionnaire.
Residuals are essentially the difference between the actual observed response values. In this case what is the largest variation of the points compared into the total amount of points.
Estimate is the individual approach points which are estimated to be in line with the total amount of points. Std. Error is the average error value of the approach points. Coefficient t values count how far away the points are from total points. Last the Pr is how likely it will deviate from t value.
Multiples R-squared model takes every unknown parameter to account, where adjusted R-squared removes those which are far away from the linear model. 6% variance means that the model fits well with the actual data.
par(mfrow = c(2,2))
plot(diagplots, which = c(1, 2, 5))
Not a glue what these are.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Nov 29 04:29:05 2021"
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(boot)
url <- "https://github.com/rsund/IODS-project/raw/master/data"
url_alc <- paste(url, "alc.csv", sep = "/")
alc <- read.csv(url_alc, header = TRUE)
dim(alc)
## [1] 370 51
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
This joined data set contains alcohol consumption of students in two Portuguese schools. Data was collected using school reports and questionnaires. Data set contains 370 results from 29 sets of data.
I chose the following four variables for personal hypothesis on alcohol consumption: 1. romantic 2. higher 3. internet 4. schoolsup
My hypothesis is that a romantic relationship and higher education decreases the alcohol consumption, but lacking the support from school and not having internet connection at home (need to stay long at campus) increases the alcohol consumption.
phPlot <- ggplot(alc, aes(x = romantic, y = alc_use, col = sex))
phPlot + geom_boxplot() +
ylab("Average alcohol consumption") +
xlab("With a romantic relationship")
As seen in this box plot. The alcohol consumption didn’t drop that much when person was in romantic relationship. Female workday usage slightly dropped, but with males, the consumption slightly increased on the less consuming males.
phPlot <- ggplot(alc, aes(x = higher, y = alc_use, col = sex))
phPlot + geom_boxplot() +
ylab("Average alcohol consumption") +
xlab("Wants to take higher education")
As assumed the higher education lowered the alcohol consumption, with males. But the alcohol consumption on females who want to take higher education, is higher. Which is total opposite on my assumption.
phPlot <- ggplot(alc, aes(x = internet, y = alc_use, col = sex))
phPlot + geom_boxplot() +
ylab("Average alcohol consumption") +
xlab("Internet access at home")
Here is second surprising results. Internet increases alcohol consumption of female persons. With males the median staid the same, but the consumption got more dispersed on the maximum and lower ends.
phPlot1 <- ggplot(alc, aes(x = schoolsup, y = alc_use, col = sex))
phPlot1 + geom_boxplot() +
ylab("Average alcohol consumption") +
xlab("Extra educational support")
This was the only assumption which went as I thought. More educational support students get, the lower is their alcohol consuption. It was a larger drop on both genders, but I assumed it was so.
m <- glm(high_use ~ romantic + higher + internet + schoolsup, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ romantic + higher + internet + schoolsup,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2857 -0.8807 -0.7814 1.5065 1.7737
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2511 0.5980 0.420 0.675
## romanticyes -0.2831 0.2547 -1.112 0.266
## higheryes -1.2242 0.5263 -2.326 0.020 *
## internetyes 0.2261 0.3289 0.688 0.492
## schoolsupyes -0.3105 0.3574 -0.869 0.385
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 444.76 on 365 degrees of freedom
## AIC: 454.76
##
## Number of Fisher Scoring iterations: 4
In the summary we can see that the variable higher is the only one which p-value is under 5%, so it can be considered as a significant. Now we can confirm on the charts on my personal hypothesis that the only really significant chart was indeed the variable higher. From the deviance we can see that high alcohol usage is not that common within 15-22 years old, therefore it is more on the minimum side. But there are variation between min and max.
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 1.2854900 0.3969561 4.2627835
## romanticyes 0.7534579 0.4527659 1.2317868
## higheryes 0.2939864 0.1007957 0.8226494
## internetyes 1.2537265 0.6707657 2.4542330
## schoolsupyes 0.7330743 0.3506197 1.4390072
As you can see from odds ratio the variable higher is the only one which is really dispersed. The other variables are quite static, without that much variation. The coefficient intervals show that there are big chunks of values are in the same area, but there are this small set of data which is dispersed away from each other.
prob <- predict(m, type = "response")
alc <- mutate(alc, probability = prob)
alc <- mutate(alc, prediction = probability > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 252 7
## TRUE 103 8
table(high_use = alc$high_use, prediction = alc$prediction) %>%
prop.table() %>%
addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68108108 0.01891892 0.70000000
## TRUE 0.27837838 0.02162162 0.30000000
## Sum 0.95945946 0.04054054 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2972973
As you can see from the data, the prediction about the high alcohol usage is quite accurate when the high alcohol consumption is TRUE. When it is FALSE, well then it is not that accurate.
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.3
Using the same method that in Datacamp. With the last 10-fold cross validation I got the same variation 0,5%. Although the predictions were 3% better.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Nov 29 04:29:07 2021"
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
data("Boston")
dim(Boston)
## [1] 506 14
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Boston data frame is a sample R dataset from R library MASS. The Boston data frame has 506 rows and 14 columns. This data frame contains the following columns:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
par(mfrow=c(7,2), fin=c(4,4), mar=c(2,5,2,1), las=1, bty="n")
plot(Boston$crim, ylab = "Per capita", main = "Crime Rate")
plot(Boston$zn, ylab = "lots over 25,000 sq.ft", main = "Land zoned for lots")
plot(Boston$indus, ylab = "acres per town", main = "Non-retail business acres")
plot(Boston$medv, ylab = "Value of homes", main = "Owner-occupied homes")
plot(Boston$nox, ylab = "parts per 10 million", main = "NOX concentration")
plot(Boston$rm, ylab = "number of rooms", main = "Rooms per dwelling")
plot(Boston$age, ylab = "built prior to 1940", main = "Owner-occupied units built")
plot(Boston$dis, ylab = "weighted mean", main = "Distances to employment centres")
plot(Boston$rad, ylab = "index", main = "Accessibility to highways")
plot(Boston$tax, ylab = "tax rate per 10,000", main = "Property-tax rate")
plot(Boston$ptratio, ylab = "ratio", main = "Pupil-teacher")
plot(Boston$black, ylab = "1000(Bk - 0.63)^2", main = "Proportion of blacks")
plot(Boston$lstat, ylab = "Percent", main = "Lower status")
Compared to crime rate. The higher crime rate seems to be in rented apartments, with higher property tax rate. The owners doesn’t live in the same area. Also higher NOX rate (car pollution) and closer to highway seems to be increasing the crime rate. Employment centers are far away. On the other hand the crime rate is not relevant to skin color or apartment room count.
# Make standardized dataset
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)
# Create categorical variable
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# Drop the crime rate from old data and add new categorial
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
# Divide the dataset to train and test sets.
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
dim(test)
## [1] 102 14
dim(train)
## [1] 404 14
Comparing the two summaries tells us that scaling doesn’t work on the most of the variables. When scaling, there are variables which can’t be negative. After scaling these variables are negative, because no clustering has being done.
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crim)
plot(lda.fit, dimen = 2, col = classes, pch = classes, xlab = "Crime Rate", ylab = "Other factors")
lda.arrows(lda.fit, myscale = 2)
From this data we can see that accessibility to radial highways affects the crime rates the most. Nitrogen oxide concentration also has a noted link in the data for incrieasing the crime rate. Instead when proportion of residential land zones increases, the crime rates are lower.
crimeCat <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = crimeCat, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 21 8 3 0
## med_low 3 20 5 0
## med_high 0 7 8 0
## high 0 0 1 26
High crime rate predictions are quite consent and it seems to be more variance on the low and med_low predictions.
data("Boston")
Boston <- scale(Boston)
km <-kmeans(Boston, centers = 4)
pairs(Boston, col = km$cluster)
I came up with four clusters. The reason for this is that from the charts the the clusters are not that dispersed and there are big centers from all the charts, which can be combined. With five clusters, there were one too small cluster or it was too dispersed. With three, there was even clusters, but with four there are also four quite even clusters. More is better.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$centers)
Both give the same value, but with crime, the values are colored.